Big Data—A Big Deal
Tired of scrolling through your cable provider's seemingly endless television listings to find a program you want to watch? Well, my friend, those days are numbered. Just as Amazon or Google offer suggestions for products or websites based on your online history, soon your TV will discern your viewing preferences based on the programs you watch. How is this done? Through Big Data—a process where multiple computers work simultaneously to mine and make sense of vast amounts of data (we’re talking terabytes and even zetabytes here)—and SCU's computer engineering assistant professor Yi Fang is helping to bring that technology to fruition, supported by a recent funding of $500,000 from Santa Clara–based TCL Research (TCL is the world’s third largest television and fifth largest mobile phone manufacturer).
"The next generation of TV will connect with the Internet," said Fang, "and it will know what has been viewed, will analyze the content of those programs to determine the users' preferences and taste, and will then make recommendations." Beyond simply helping us choose our shows, tomorrow's TV will offer information about the products, locations, and people featured in the video—where you can buy that shirt, what derailleur is on that bicycle, or whatever topics are trending on social media.
"There are many technical challenges," Fang continued. "The sheer volume of videos that are available demands that we have a way to quickly process images and dialog to recognize all those products, styles, locations, people, subjects…. We need the computer to understand what the program is all about and understand what you are most interested in. Our work is to try to pinpoint the exact information the user wants and provide it seamlessly."
Additionally, with funding from the School of Engineering, Fang recently helped install a Hadoop cluster of 24 computer nodes and 250 TB data storage in the SCU Design Center. Data that is too large to be housed on a single machine is divided up and sent to the different computers to process a particular portion, after which the results are aggregated. The cluster is available for use by any engineering faculty member and Senior Design student teams who need to process a huge amount of data. Fang is also using it to analyze more than one billion webpages—building an index, mapping between each word on every page in every language to help advance the search process. That's Big Data!
Fang's team also works with engineers and researchers from other Silicon Valley companies, such as Google, Microsoft, Yahoo!, Apple, and SimplyHired. "Through these collaborations, I am able to access real-world data to address challenging problems in Big Data, and my students have the opportunity to engage deeply with Silicon Valley technology giants," he said. "It's a real advantage to be in this location. We cannot just stay in our school; we need to talk with people in the field and have access to their resources. Companies we work with, and visit frequently, have thousands of machines to process huge amounts of data. There's no way a university can have so many machines and so much real-world data, so it is very helpful to be so close to those who do."
To keep up with the need for engineers trained in this field, a new Big Data track within the graduate engineering master's program has been established. Comprising three new courses, COEN 240 Machine Learning, COEN 241 Cloud Computing, and COEN 242 Big Data, the series takes students from theory to infrastructure to application. Fang has also developed a number of undergraduate courses addressing web information management, web search and information retrieval, and web technologies.
"This is an exciting new path for the School of Engineering," he said. "This field is interdisciplinary—not just for computer engineering—because data is generated very quickly in every discipline. We need computational tools to process, analyze, and understand data. I think there are many opportunities to help other disciplines solve their own challenges in managing and understanding data and putting it to use."