¡¡ ¡¡ ¡¡ ¡¡ ¡¡
¡¡ ASLP Logo

Research Schemes

Multimedia Information Retrieval:

With the development of multimedia and network technologies, there is a large amount of multimedia content (e.g. recordings from TV or radio broadcast, presentations, meetings or lectures) readily available in the growing global information infrastructures. This has brought about the urgent demand for automatic multimedia indexing, retrieval, visualization, organization and management technologies. It is important for information providers to offer a personalized aggregation of relevant multimedia content dynamically upon any user request. Hence we aim at develop technologies that can automate the processes of:

a. Multimedia fission:

This procedure identifies basic constituents of media content, e.g. shots in a video file, stories/topics in an audio/video file, a textual paragraph, a graphic; as well as their groupings (i.e. higher level constituents), e.g. a textual story together with its illustrative graphic.

b. Multimedia categorization:

This procedure classifies the identified constituents (from the previous step) according to appropriate semantic categories.

c. Multimedia fusion:

Given a user¡¯s request, multimedia fusion aims to fuse relevant multimedia constituents into a usability-optimized form for information display for a user.

Speech Recognition:

Research on speech recognition covers from the template based small vocabulary isolated word recognition to Hidden Markov Model (HMM) based large vocabulary continuous speech recognition. The hardware implementation methods on the small vocabulary speech recognition have been authorized with three patents: one national invention patent ¡®speech control device and method¡¯, and two practical new type patents ¡®voice control device¡¯ and ¡®speech controlled toy circuit¡¯. A large vocabulary continuous speech recognition system and a lip reading system have also been built.

Speaker Recognition:

Speaker recognition (SR), or Voiceprint Recognition is to judge one's identity according to his voice character. In applications about access control, because of the tremendous distribution of telephone, the SR technology becomes the most convenient approach of remote access control. It can also be used in retrieving special person's talk or marking and indexing audio stream, such as telephone surveillance and meeting record, etc. The most common features used in current SR systems are based on low-level temporal spectral (or a simple function of it, such as the mel cepstrum), which is a fragile information carrier and is distorted by many things, e.g., channel, noise, even small amounts of room reverberation, etc. Therefore we need to find high-level features, such as idiosyncratic word usage and pronunciation, prosodic patterns, and vocal gestures. In audio diarization applications, to follow the speakers changes and to segmenting utterances of different speakers without priori knowledge are also important issues.

Audio Visual Speech Processing:

Audio visual speech processing is a new research field crossing speech, image and computer vision, aiming at fusing the audio and visual information of human speech. The research topics include audio visual speech recognition and speech units segmentation, audio visual emotion recognition and expression, audio visual speaker recognition, as well speech driven, or text driven (emotional) talking head animation.

Audio Signal Processing:

Audio signal processing is techniques of processing digital audio data collected into computer. We focus on digital sound effects, virtual sound, audio watermark, speech signal enhancement, sound localization and tracking, etc. The digital sound effects are to simulate the different sound effects used in multimedia applications, such as reverberation, echo, pitch/ speed modification, equalization, etc. Virtual sound is to reproduct the real position of sound source with fewer channels' information, based on the binaural localization ability of human. Audio watermarking is one of the technologies of information hiding, i.e. to embed some unaware information into audio signal. Speech enhancement is a classical issue while many problems exist yet. Sound localization and tracking is to find the position of sound source with microphone array. We are now focus on speaker localization and tracking indoor.