Human-centric Computing and Information Sciences volume 13, Article number: 18 (2023)
Cite this article 1 Accesses
Despite active research on the design of presentation tools after the emergence of slideshow presentations, there is a lack of research findings on appropriate modalities of interactions for controlling slides in an exploratory manner. The objective of this study is to find the appropriate modality for features of controlling slides and to design usable features. This study used an iterative design process based on the Wizard-of-Oz (WoZ) prototype and participatory design (PD), which was divided into three phases. In the first phase, the participants were directly involved in the ideation process, and they created an initial design set. In the second phase, the initial prototype was evaluated by the participants with WoZ, focusing on the scope of co-speech gesture interactions. Finally, the usability of the final design set was evaluated, and it was demonstrated that the proposed design features were usable in terms of naturalness, controllability, efficiency of information delivery, and efficiency of resource use. The results also showed that verbal modality was more dominant, while many previous studies focused on creating gesture-based systems. This research is expected to provide guidance for designing a hand-free presentation with a co-speech gesture, and benefits for conducting PD research with WoZ.
Co-speech Gesture Interaction, Participatory Design, Wizard of Oz, Slideshow Presentation Tool
After the emergence of slideshow presentation programs such as Microsoft PowerPoint and Apple Keynote, the functionality and affordances of presentation slides have greatly improved. These presentation tools offer various features for computer-animated slides that allow speakers to interact with audiences and/or slides through visual effects, content, and control [1, 2]. Furthermore, there are diverse research studies examining interaction mechanisms for making presentations better for delivering speakers’ messages. Some have worked on interactive navigation functions to overcome the limitation of the linear navigation structure that is predominant in the current presentation tools [3–6]. For instance, zoomable user interfaces for navigation in presentation are used in the commercial program Prezi [7–9], and other research groups have proposed hyperlinked slideshow applications .
The control of slide navigation is another important issue in research on presentation programs. Currently, a remote controller is the most common controller; however, it can only support a linear navigation structure. Several studies have examined alternative ways to mediate the control of slides and components. One such research used papers as the main medium for controlling slides, such as a slide card with a barcode printed on it , and a specially treated paper to recognize the position of a special pen . Another mainstream approach is to use human gestures as a natural medium for controlling slides [11, 12]. In these gesture-based systems, certain intuitive gestures signal and perform the corresponding controls defined a priori.
Along with all these studies, diverse possible modalities are mediating the control of slides, for instance, modalities requiring handheld extra devices such as laser pointers, remote controllers, digitized papers, interactive smartboard pens, and natural human modalities such as gestures, speech, and co-speech gesture. While these modalities have their own characteristics, there is a lack of research on finding the appropriate modality for slide control. Previous research has mainly relied on the researcher’s insights and heuristics for choosing the modality.
The objective of this research was to create usable, intuitive designs with features that are natural, controllable, and efficient in information delivery and resource consumption. We attempted to investigate an appropriate modality for slide control without any bias by assuming that any modality can be a candidate. Once we investigated the appropriate modality, we designed features for controlling the slides in the modality through an iterative design process. This study implemented a low-fidelity prototype and tested in Wizard-of-Oz (WoZ) setting. Creating and testing high-fidelity prototypes using the cutting-edge technologies like artificial intelligence (AI) would be time consuming and expensive. Furthermore, most derived issues during usability test may focus on the AI’s performance part, making it difficult to deeply examine the experiences and needs in terms of interactions. Therefore, this research tried to move beyond technology constraints and anticipate future user reactions, allowing the design aspects to be specified in advance. These efforts can be utilized as reference data for design in the future.
Studies on Diverse Modalities
Although laser pointers and remote controllers have been used for a long time to control presentation slides, many researchers have sought better modalities to enhance controllability. Interactive smartboards have been employed to improve the level of interactivity during presentations, particularly in learning environments . In these systems, digital markers are used to connect users to a system.
Some researchers have used digitalized paper as a medium for presentation control [3, 4]. An early study is the Palette program, which provides control of random navigation among slides . Prepared slides are printed on a special paper card with a barcode for presenters, who can change the order of the slides by reading the palette barcode using a reader. Singer and Norrie  suggested a different paper-based system, which allows presenters to annotate slides through special papers and a digital pen: the presenter’s annotation and button-click actions are sent to the host computer controlling corresponding slides.
While these modalities are based on physical devices that allow direct controllability to users, another stream of research has focused on natural human modalities that human beings use daily to communicate, such as gestures. Human gestures have been examined as natural mechanisms for presentation control [11, 12, 14]. Cao et al.  evaluated three different modalities of presentation control, namely laser pointer, mouse/keyboard, and bare-hand gestural controls, using the WoZ. The research revealed that bare-hand gesture control received the highest score in all quantitative ratings on clearness, efficiency, and attractiveness. Fourney et al.  developed a slideshow presentation tool with gestural interfaces to resolve problems related to navigation and annotation  in presentation tools.
To develop these gesture-based systems, defining triggers which are the user signals of system actions that execute a functionality, is important. The triggers in many gesture-based systems are designed by system developers based on their insights and available technologies. Negroponte  suggested the use of human-behavior-based approaches instead of choosing easily learnable, memorable, and efficient triggers.
In contrast to the gestural modality, the verbal modality did not receive much attention for presentation systems, although it is the most dominant modality of human communication. This could be because of the considerable interruption caused by speech triggers and the reliability of speech technology. However, it might be time to consider speech as the main modality for the successful deployment of commercial speech interaction systems such as Apple Siri, Google Assistant, and Microsoft Cortana.
The combination of these human modalities forms co-speech gesture . With the great interest that co-speech gesture processing has recently received in several fields including medicines , conversational agents [18, 19], and oral presentation , this modality could also be considered a candidate modality for such applications. In previous studies, however, the researchers chose one main modality, and, in that modality, they designed triggers or features for controlling slides or developed systems. This study attempted to investigate the appropriate modality for such an application in an exploratory manner, along with designing usable features.
Studies on Diverse Features of Slide Control
While previous studies targeted different modalities, they also addressed the different features of controlling slides. The Palette program targets nonlinear page navigation using special paper cards , and PaperPoint tackles the annotation along with nonlinear page navigation . In some studies, zooming features were designed and implemented [7–9]. Diverse features have been implemented in gesture-based systems, such as linear navigation, zooming, and video control, by mapping each feature to a different gesture [11, 12, 14, 15].
Research Context and Method
The main objectives of this research were to explore the appropriate modality supporting the user’s natural control of presentation, which can be the basis for the design of the next generation presentation tools and control features in that modality. Because effective presentation methods may differ depending on the field and content, it is necessary to minimize the potential effect of these fields and contents. Therefore, this research was conducted with a relatively homogeneous group of engineering graduate students in a university setting.
Because of the exploratory nature of this research objective, we employed both participatory design (PD) and WoZ as the main methods for data collection. First, we employed PD in which end-users actively participate in the design process from an initial stage [21–24]. Unlike other design methods that mainly rely on researchers’ insights and heuristics, PD brings together the experiences of multiple stakeholders to gather insights into different aspects of a targeted solution through their participation as co-designers. However, PD can be problematic when participants do not know exactly what they want, or how to explain their tacit knowledge. To address this issue and help the participants develop design ideas, we used a layered elaboration technique  that allows participants to contribute ideas provided by others while also encouraging them to expand on those earlier ideas.
Second, we used WoZ, which supports researchers in reducing the cost of full implementation through a program simulating the design features. In WoZ, users assume that they are using a real system, while a human operator reacts behind the scenes. WoZ was also used to collect users’ behavioral patterns with a target application. WoZ is widely used in study on novel gestures [26, 27] and discourse research [28–30]. Computational linguists at an early age found that the linguistic characteristics of human-machine dialogs are different from those of human-human dialogs. Because research studies on intelligent language interfaces have mostly been conducted using corpus analysis, it is critical to collect such data representing the characteristics of human-machine dialogs. Furthermore, data collected from the WoZ are useful for the development of dialog systems using data-driven machine learning algorithms because these algorithms require a realistic dataset [31, 32].
In this study, PD mainly supported initial design prototyping and design refinement, whereas WoZ was employed for usability testing (Fig. 1). First, an initial design was drawn from the participants (initial design prototyping). Many different PD techniques can be used to obtain initial design ideas. The resulting design was imitated by WoZ and tested for usability (usability test). This test, through WoZ, can give the participants opportunities to see the issues with their ideas. Once the usability test results were obtained, the design was refined (design refinement) according to usability. By dividing the entire number of iterations into three phases, we held a design workshop for each phase.
In the first phase, a design workshop was held to develop an initial design prototype for the slideshow presentation tool. This phase had two primary objectives: find the appropriate modality for controlling slides and derive initial ideas for designing a presentation tool. At this stage, our research approach was mainly exploratory, leaving participants in an open context where they could freely suggest and discuss ideas for the initial prototype with various features covering a wide range of presentations.
Six participants (n=6) were recruited for the design workshop in Phase 1. The participants were either researchers or doctoral students (three males and three females) from a research university with science and engineering majors. They reported that they made an average of two work-related presentations per month. The workshop included three sessions: introduction, presentation, and focus group interviews (FGIs).
The objective of the research was described to the participants, and consent forms were collected from all participants. An ice-breaking session was then conducted to help the participants feel more comfortable interacting with each other.
|Designs for interaction with audience (DIA)||DIA-1. Automatic translation of the presentation for audiences with different mother tongue||Automatic supports for bidirectional interactions between the speaker and the audience for clear delivery of information|
|DIA-2. Functionality that automatically generates a QR code for an easy access to the information of the slide|
|DIA-3. Feedback of the comprehension level of the audience by understanding the facial expressions and movements|
|DIA-4. Listing the audiences who have questions and pressed the button|
|Designs for interaction with environment (DIE)||DIE-1. Combining the slides into one for removing time waste for transitions||Elimination of resource wastes for controlling environmental settings|
|DIE-2. Setting all equipment needed for the presentation by pressing just a single button|
|DIE-3. Alarming the speaker through watch vibration to inform the time left|
|DIE-4. Inclusion of videos in the file for easier management|
|Designs for interaction with slide (DIS)||DIS-1. Moving into the previous, next or specific page indicated by presenter’s speech and gesture||Supports for controlling slides (medium) using natural modality both for clear information delivery, for efficient use of resources and the high controllability|
|DIS-2. Controlling the play of videos in the slide using speech and gesture|
|DIS-3. Showing a pointer when directed by an arm|
|DIS-4. Zoom in/out according to presenter’s speech and gesture|
|DIS-5. Search references and contents indicated by presenter’s utterance|
The main aim of the second phase was to help participants concretely experience and refine the usability of the key features derived in Phase 1. Central to this refinement process is the realization of the initial design through WoZ.
Table 2. Final list of the co-speech gesture feature set obtained from Phase 1
|Page navigation (DIS‑1)||Speech Triggers:|
|‘move to <SLIDE_SUBJECT>’,|
|‘go to <RELATIVE_PAGE_INDEX>’,|
|‘go to page <PAGE_NUM>’|
|swiping a hand|
|Video control (DIS‑2)||Speech Triggers:|
|‘play’, ‘stop’, ‘pause’, ‘forward’, ‘backward’, ‘next bookmark’, ‘previous bookmark’|
|pointing for specifying a video|
|Pointer (DIS‑3)||Gesture Triggers:|
|pointing to a spot on the screen|
|Zooming (DIS‑4)||Speech Triggers:|
|‘zoom in’, ‘zoom out’,|
|‘zoom in here’|
|pointing for specifying a spot to zoom in|
|Hyperlink (added)||Speech Triggers:|
|‘enter the hyperlink’, ‘exit’,|
|‘enter this hyperlink’|
|pointing for specifying a hyperlink|
|scrolling for navigation in the link page|
|Object control (added)||Speech Triggers:|
|‘enlarge this picture’,|
|‘reduce the font size’,|
|‘move this image there’,|
|pointing for specifying an object|
|pointing for moving the selected object|
|Verbal modality||Gestural modality||Simultaneous multimodality||Natural/Guided|
|Naturalness||Controllability||Efficiency on information delivery||Efficiency on resources use|
|Page navigation||4.1 (1.0)||4.3 (0.9)||4.2 (0.5)||4.3 (0.9)||3.4 (1.1)||4.6 (0.7)||3.7 (1.3)||4.4 (0.7)|
|Video control||3.0 (1.3)||4.4 (0.7)||3.2 (1.1)||4.3 (0.7)||4.2 (0.7)||4.2 (0.9)||4.6 (0.7)||4.6 (0.6)|
|Pointer||4.4 (0.7)||4.3 (0.6)||3.8 (1.3)||4.3 (0.7)||4.4 (0.4)||4.4 (0.7)||4.4 (1.1)||4.3 (0.7)|
|Zooming||3.6 (0.4)||4.3 (0.6)||4.0 (0.8)||4.4 (0.8)||3.8 (1.1)||4.6 (0.8)||3.2 (1.3)||4.6 (0.8)|
|Hyperlink||3.4 (0.9)||4.7 (0.5)||3.4 (1.1)||4.3 (1.0)||3.4 (1.4)||4.1 (1.1)||4.0 (0.8)||4.1 (1.1)|
|Object control||2.0 (0.8)||3.7 (0.8)||2.4 (0.7)||3.9 (0.6)||2.4 (1.4)||3.6 (0.8)||3.8 (1.3)||3.6 (0.9)|
|Annotation||-||3.5 (0.8)||-||3.9 (0.9)||-||3.6 (0.9)||-||3.8 (0.9)|
From the FGI session of Phase 2, the participants primarily suggested speech-based, gesture-supported, natural, and simultaneous multimodal interactions.
This work was partially supported by the Centre for Innovation and Transfer of Natural Sciences and Engineering Knowledge of the University of Rzeszow, Poland.
Speeches vs. gestures
The participants in this study agreed that gestures alone were not enough to form a natural and usable interaction from the first phase. Gestural expressions are better suited for expressing shapes and directions, but it is challenging to express other types of messages, such as “play the video.” By contrast, verbal expressions can easily contain these messages. One participant said, “It would be very hard to express yellow through gestures.” Furthermore, what the participants suggested was not that the verbal modality supports the gestural modality but the other way around, that is, the verbal modality dominates. This is similar to that discussed in ; they showed that gestural modality mostly supplements verbal modality in spatial dimensions. In the design specification (Table 3), four of the features (page navigation, pointer, zooming, and annotation features) used the gestural modality as the primary, while the others mainly used the verbal modality, obtaining the position of the target object from the gestural modality.
Both primary and auxiliary gestures were predominantly in the form of pointing gestures. The difference between the primary and auxiliary use of the gestural modality was thus based on whether the pointing gesture alone could express the presenter’s intention. The features tagged with the primary use of pointing gestures were the pointer and annotation features. If the speaker points at the screen, he already expressed his intention to emphasize a point or write on the screen. However, in the case of the annotation feature, the pointing gesture is not the only primary trigger. To enable the pointing gesture as the primary trigger, the speaker needs to verbally express that he/she wants to use the annotation feature (e.g., “use annotation feature”), allowing the system to distinguish what the intention of the pointing gesture would be. For this purpose, the annotation feature requires verbal and gestural modalities as the primary. The video control, zooming, hyperlink, and object control features, on the other hand, use the pointing gesture to complement the intention by specifying the target object leading the corresponding tag to be “auxiliary.” There are two exceptional features that use gestures other than pointing gestures: page navigation and zooming. Page navigation does not require any complementary information from pointing gestures; therefore, verbal modality alone becomes the primary modality. However, a swiping gesture is suggested as a guided trigger; therefore, the gestural modality can also be used as the primary modality. Here, two modalities are exclusively used to express the intention of page navigation; that is, either modality can be used alone. With the zooming feature, the pointing gesture has already been used as an auxiliary modality to specify the target position of zooming. In addition, the participants used a guided trigger for the zooming feature using hand pinch and stretch gestures. Because the gestures for this trigger can only convey the intention of the speaker to zoom in/out, the gestural modality of the zooming feature is regarded as primary.
Natural vs. guided
In gesture-based systems proposed in previous studies [11, 12], presenters are required to perform specific guided gestures for triggering system actions mostly designed by system developers. This study explored what type of gestures and speech would be the most appropriate triggers for such applications, instead of designing guided gestures a priori. Our participants’ ideas about triggers were simple: “as natural as possible.” Presentations are real-time performances, where the main objective of the presenter is to deliver a message to the audience. If the presenter makes an unnecessary gesture or speech in the middle of the presentation, the interruption disturbs the message delivery. Consequently, our participants often expressed their intention to trigger a system action naturally during a presentation in advance; therefore, the most preferred interface would come from understanding the intentions from these natural expressions. In terms of speech interaction, they wanted the system to understand natural phrases such as “Well, then let’s watch the video, here” instead of constraining presenters to use only guided triggers such as saying “play.”
However, participants did not want to eliminate all command-like guided triggers. One of the participants, in the FGI session, said ‘However, it would be also strange to keep saying “let’s move to the next page’ again and again.” Although there are many possible sentences with the same intention, repeatedly saying naturally formed sentences can cause even a higher level of interruptions than those caused by the continuous use of guided speeches or gestures. Therefore, the mixed-use of both natural and guided co-speech gesture was suggested to be the most suitable for such presentation applications.
There are several implicit factors to consider regarding natural verbal triggers. Natural speech utterances include sentences with multiple intentions. One participant brought this issue into the discussion. When presenters naturally express their intentions, they may use complex sentences to express multiple intentions. She explained a situation, “let’s watch the video again after moving to the page with the video.” This requires more challenging natural language processing technologies. The dependent clause ‘after moving to the page with the previous video’ shows the intention, which is to perform page navigation. However, the target is specified by its properties, which are highly dependent on the context. Overall, this aspect is closely related to the domination of verbal modality over gestural modality in presentation control. Since the most natural way of expressing they intend to use verbal modality, verbal modality dominates in the design feature specification, and gestural modality is mainly auxiliary for supplementing the spatial dimension of the message.
Usability of intermediate design features
Although the number of samples was small (n=5), we examined how the participants perceived the usability of the features (Table 4). First, the controllability of all features, except object control, received high scores. This means that these design features allow presenters to easily control the slide, as desired. However, two of the participants commented that the delay in the system reaction that occurred in the video control made them feel uncontrollable. The other two participants had low scores (2 points) for the controllability of the hyperlink feature owing to the absence of control after entering a hyperlinked page (For this, we added a scrolling function to a website. Other than that, we thought adding more features for controlling websites would be out of scope). These two comments show the importance of the response time of the system and coverage of the features.
Second, we could find how natural it is to use gestures for emphasizing (naturalness, 4.4), and the presenter points to deliver information more clearly (efficiency of information delivery, 4.4). Notably, the efficiencies of information delivery and resource use for video control features are high (4.2 and 4.6%, respectively). All the participants mentioned that they would use the video control feature even if only guided gestures were supported because there was no available alternative for controlling videos from a distance. The hyperlink feature, which does not have any alternatives, also received slightly lower scores because two participants focused on the absence of further control on the linked page.
Finally, it is also remarkable that while the first three scores for the object control feature were quite low, the participants suggested leaving it in the final design list. Since presentations usually made by the participants are often academic and instructive with adult audiences, they would hardly use the object control feature, yet there can be other types of presentations, such as those used for teaching children, in which this feature could be helpful; therefore, the feature was retained in the final list for Phase 3.
The final design features were tested in the third phase. By implementing the final design as a WoZ system, we could see how usable the final product would be and how feasible it would be to implement the final design with the currently available technology, even before the development of full-scale actual technology.
We implemented the WoZ of the final design product and tested the usability of the features in terms of their naturalness, controllability, efficiency of information delivery, and efficiency of resource use, as in Phase 2. We also collected data on the recorded presentation for system development because they could reflect the user behaviors expected in the actual implemented system.
All five participants from Phase 2 joined the final test, and we collected seven additional subjects (six males and one female, all graduate students) to enlarge the sample size (n=12). Each subject made a five-minute-long presentation with WoZ setting as in Phase 2 and listened to the presentations made by the other subjects. After the presentations, the subjects completed a survey measuring the naturalness, controllability, efficiency of information delivery, and efficiency of resource use. The topic for the presentations in this phase was not pre-assigned, as long as they shared their knowledge with the audience, as the presentations in the first and second phases. After the usability test, we conducted semi-structured 1:1 interview with the participants who attended all three workshops to complement the findings from the quantitative survey and collect feedback on the overall design process.
The results of the usability tests are listed in Table 4. Overall, we observed an improvement in all four categories based on the usability scores of the intermediate design features. Specifically, the naturalness and controllability of the features improved significantly during this phase. This improvement comes from allowing natural triggers. According to our participants, the presenters commonly express their intention to play a video, zoom in, or open the linked site because it is a sudden event that the audience cannot expect in advance. Therefore, the naturalness and controllability of these features were improved. However, a less sudden event such as page navigation did not improve much in terms of naturalness and controllability. The presenters rarely expressed their intention to change pages forward or backward using a natural trigger and, in most cases, used guided gestures instead. The participants mentioned that natural speeches for changing slides were used only to fill the gap during the transition of pages. In other cases, saying something only for transition disturbs, hence leading the presenters to use gestures instead. The participants also emphasized the advantage of using gestures for page navigation. In other words, the presenters can explain while navigating the pages. However, natural triggers seem effective for the page navigation feature when the presenters randomly access pages by changing the screen among pages far from each other. Random accesses, such as re-visiting the page long before, were sudden effects, making the presenters naturally express their intention. Because of the addition of natural triggers, the presenters did not need to express any extra guided triggers, causing fewer interruptions and increasing the efficiency of information delivery and resource use.
It is also notable that the scores on the efficiency of information delivery and resource use for the zooming feature were largely improved, partly because of the additional guided triggers added in the second design workshop. As described earlier, more guided triggers, apart from natural triggers, were added for the zooming feature: pinch and stretch gestures. Almost half of the cases using the zooming feature in their presentations used these guided gestural triggers instead of natural speech-driven triggers. One participant commented, “The pointing gesture and zooming gesture were not that unnatural because they are quite intuitive gestures.” It appears that the common past experiences and intuitiveness of such gestures allow the speaker and audience to accept guided gestures. Moreover, the participants often felt more comfortable using carefully designed guided gestures than natural or guided speech because they could interact with the slide while verbally delivering their messages.
Finally, the results showed considerably lower scores for the object control and annotation features than the scores of the others in all four measures. We speculate that these two features were uncritical functionalities for the type of presentation that the participants made in this research context (e.g., academic and instructive presentations for adult audiences). None of the presenters used the object control feature, whereas the others used the annotation features. Because these features dynamically change the contents of the slide, the need for these features comes from situations where it is critical to have active interactions between the content and the audience.
This section further discusses some implications of the study, such as the use of the designed features with other existing tools, and the benefits of using WoZ in the design process.
Mixed Use with Other Tools
An important observation from individual interviews is that the proposed design is not always the optimal solution for interacting with slides during a presentation. A remote controller is suitable for linear page navigation because it can provide the presenter with the ability to navigate seamlessly and linearly to the audience. Regarding the guided gestural trigger for linear page navigation, the remote controller also allows the presenter to change the slide while explaining with less interruption owing to the absence of visible gestures. A laser pointer can provide the same effect in a simpler manner.
However, the biggest limitation is the need for extra devices; to use these functionalities, presenters should always bring these tools with them. When the presenter does not have tools with them, the proposed features can be used as alternatives. The linearity of the remote controller is another limitation that can be determined by the page navigation feature. Finally, the pointer feature reinforces the visibility, which is a limitation of the laser pointer. The laser pointer was not visible when the distance between the screen and audience was large.
Because the participants prioritized making the least interruption for a better presentation, they suggested the mixed-use of the proposed design features with the currently available tools because they can complement each other. While a remote controller provides seamless linear navigation, the proposed design features provide the ability to access random pages, control videos, enter hyperlinks, etc. The choices among the tools and features were made in real time based on the expected interruption level of the possible features.
Benefits of WoZ
During the individual interviews, the participants agreed with the helpfulness of WoZ in the design process. In this study, the utilisation of WoZ in the PD framework resulted in two main benefits: design process and data collection. The former implies that WoZ helps participants concretise their ideas better, and the latter implies that WoZ can reduce the cost of data collection by integrating the data collection process into the design process.
Benefits to design process
In this study, the integration of WoZ and PD makes the refinement process meaningful. The target domain of our study, co-speech gesture interaction, is an exemplar case in which the implementation of the system is highly laborious, costly, time-consuming, and technically challenging. Hence, it is not feasible to implement all the design features suggested by participants in the PD process. Instead, the WoZ can function as a simulation of a real system. In this study, we observed that the utilization of WoZ was helpful in generating design ideas and deepening the details of each feature in the design process.
Benefits to data collection
Apart from the benefits to the design process, WoZ also provides benefits in terms of data collection. In this study, the implementation of co-speech gesture interactions requires several challenging technologies, such as co-speech gesture recognition and dialog management for planning system actions. However, collecting a realistic dataset for natural co-speech gesture is complex. This collection process is more than simply collecting natural co-speech gesture in a presentation setting because it is known that human beings use different languages according to whether they interact with machines or humans . To manage this difference, WoZ has been utilised to collect realistic natural utterances for human-machine interactions in language processing fields [28, 30– 32, 40]. If one has a design of intelligent interactions and wants to implement it, then one probably needs a WoZ system to collect a realistic dataset for training a machine-learning model.
In this study, we integrated WoZ into the design process and used it as part of the design process. As a result, we did not need a separate data-collection process. Instead, we collected data during the design process by recording the participants’ trials using Microsoft Kinect v2. The resulting dataset contains a total of 17 video recordings comprising about two and a half hours occupying 1.5 TB. The dataset was multimodal, including 1920×1080 uncompressed color videos (30 Hz), 512×424 uncompressed infrared videos (30 Hz), and audio streams from a microphone array with four microphones.
In this study, we conducted an iterative PD process with WoZ to design a new slideshow presentation tool with co-gesture-speech interactions. The overall process was divided into three phases. In the first phase, the participants brainstormed new features for a slideshow presentation tool and created an initial design set. Among the diverse features in the initial design set, we confirmed that speech and gestures would be the most suitable form of interaction between the presenter and slide. Focusing on the scope of co-speech gesture interactions, the initial design was then evaluated by the participants in a WoZ setting, allowing us to have a refined final design in the second phase. The refined final design set reflects the need for both natural and guided triggers for each feature. We also observed that verbal modality was more dominant, while many previous studies focused on creating gesture-based systems. Finally, the usability of the final design set was evaluated, and it was demonstrated that the proposed design features were usable in terms of naturalness, controllability, efficiency of information delivery, and efficiency of resource consumption.
In this study, we obtained a set of design features for a new slideshow presentation tool using co-speech gesture and could unpack the nature of presenters’ co-gestural speech as an interaction medium. Nevertheless, this study has several limitations. Statistical generalization is difficult because the number of subjects was small, and it consisted only of engineering-related majors. In addition, the findings of this study are limited to the slideshow type presentation. In this study, only some elements of usability were evaluated, but if the types of gestures increase, evaluation of memorability and learnability aspects should also be treated as important. It will also be necessary to evaluate interactions at the user experience level including affection.br>
We present the possible directions for future research. Because our application targets slideshow presentation support, we restricted the scope of the functionalities to on-slide functionalities. The only exception was the scrolling functionality on a website opened through a hyperlink on a slide. The participants suggested some features beyond this restriction, such as complex webpage control and annotation functionalities. In future research, to extend the application to an interactive smartboard through co-speech gestures, one might need in-depth research on the nature of co-speech gesture usage for these features.
We plan to work toward the develop the technologies required for the implementation of the design. There are still immature technologies required for the implementation of the design, such as multimodal speech and gesture understanding and target-pointed on-screen object calibration. Using the dataset collected through WoZ in this study, we plan to improve these technologies and develop next-generation presentation tools for seamless and natural interaction. In particular, with the advent of tools such as Google ML Kit and Huawei ML Kit, the gesture recognition and automatic speech recognition algorithm has become lighter, enabling high-accuracy and high-speed recognition even with mobile phones. By taking full advantage of these advantages, we plan to develop a tool that can perform hand-free presentation using only with a mobile phone’s camera sensor and microphone.
Conceptualization, KYS. Funding acquisition, KP. Investigation and methodology, KYS. Project administration, KYS. Resources, KYS, JHL. Supervision, KP. Writing of the original draft, KYS. Writing of the review and editing, KYS, KP. Software, KYS, JHL. Validation, JHL, KP. Formal analysis, KYS, KP. Data curation, KYS. Visualization, KP.
This study was supported by the Translational R&D Program on Smart Rehabilitation Exercises (TRSRE-Eq01), National Rehabilitation Center, Ministry of Health and Welfare, Korea. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2021R1G1A1012063). The present research has been conducted by the Research Grant of Kwangwoon University in 2020.
The authors declare that they have no competing interests.
Please be sure to write the name, affiliation, photo, and biography of all the authors in order.
Only up to 100 words of biography content for each author are allowed.
Name: Ki-Young Shin
Affiliation: Pohang University of Science and Technology
Biography: Ki-Young Shin is currently working in Designovel (utilizing generative AI technology with a focus on fashion and related industries) as a cofounder and pursuing Ph.D. degree in Convergence IT Engineering from Pohang University of Science and Engineering, Pohang, South Korea. He received the B.B.A. degree in 2011 from Hanyang University, Seoul, South Korea and started his career as a marketer at Samsung Mobile in the same year. He also had worked as a consultant at iX&M (Interactive Experience & Mobile), IBM GBS, from 2015 to 2016. His research interests include intelligent interaction, natural language processing, and artificial intelligence.
Name: Jong-Hyeok Lee
Affiliation: Pohang University of Science and Technology
Biography: Jong-Hyeok Lee received the B.S. degree in mathematics education from Seoul National University, Seoul, South Korea, in 1980 and received the M.S. and Ph.D. degrees in computer science from KAIST, Daejeon, South Korea, in 1982 and 1988, respectively. He is currently working as a professor with the Graduate School of Artificial Intelligence at POSTECH, Pohang, South Korea. His research interests include natural language processing, machine translation, information retrieval, and artificial intelligence.
Name: Kyudong Park
Affiliation: Kwangwoon University
Biography: Kyudong Park is currently working as an Assistant Professor in the School of Information Convergence at Kwangwoon University, Seoul, South Korea. He received the B.S. degree in Computer Science and Engineering from Kyungpook National University, Daegu, South Korea, in 2012, and the Ph.D. degree in Convergence IT Engineering from Pohang University of Science and Engineering, Pohang, South Korea, in 2019. His research interests include Intelligent Interaction, Human-centered AI, and Online user behavior analysis.
Ki-Young Shin1, Jong-Hyeok Lee2, and Kyudong Park3,*, Hands-Free Presentation Tool with Co-speech Gesture Interactions: A Wizard-of-Oz Study, Article number: 13:18 (2023) Cite this article 1 AccessesDownload citation
Anyone you share the following link with will be able to read this content:
Provided by the Springer Nature SharedIt content-sharing initiative