Robots subverting humans is the theme of many science fiction stories, but scientists who want to get things done here and now are still focusing on the current challenges and bottlenecks of artificial intelligence, which reflects the wisdom of human beings.
At present, the outburst of data in the world is almost out of control, and we need a radical innovation to classify and calculate the data. Fundamentally speaking, human beings have not yet fully adapted to the data-based life, as when a human body has not adapted to the rhythm of the assembly line.
Contradiction reigns. We can find some similarities in today’s contradictions with those of industrial era.
The relationship between the flying shuttle and spinning jenny is illustrative. In 1733, John Kay invented the flying shuttle. With the prior shuttle, two weavers were needed for large looms. But one weaver could operate the flying shuttle, thus significantly increasing weaving speed. But a problem soon emerged: weaving needs cotton yarn, but the speed of spinning the yarn could not keep up with the weaving demand. The need could only be addressed with more spinning machines and spindles. In 1764, James Hargreaves invented the spinning jenny (with jenny being slang for an engine), which doubled the spinning efficiency; the speed of spinning finally caught up with the speed at which the flying shuttle consumed raw material. Several years after the spinning jenny, the spinning mule (a machine to spin cotton) and the self-acting (automatic) mule were invented. By, this time, the speed of the shuttle weaving was not fast enough, so the invention of the power loom was promoted. The two sides, spinning and weaving, inspired each other. At about the same time, the Watt steam engine came to the world, steam force was awakened, and the spinning and weaving sectors were rushing to incorporate this force. At that point, the industrial revolution had been unfolding through countless mechanical advances.
The relationship between artificial intelligence and data is similar to that between the flying shuttle and spinning jenny. In the past, people conceived a method of machine learning but lacked a sufficient amount of data to verify the learning and practice it. The Internet explosion finally made data available, but dealing with explosive growth of data tests hardware capabilities and computing power.
The brave attempts of adventurers contributed to the success of big Internet companies such as BAT. These giants share a deep understanding of how to deal with massive data.
During the early stage, Alibaba used the Oracle Database system for data storage. This database architecture of the Internet 1.0 era quickly failed to handle the explosive growth of e-commerce data. Alibaba had to revitalize to build and use its own database.
Before the start of 2013, Jingdong often suffered from server crashes due to the surge in visits. It had to update the back-end architecture and replace .NET technology with Java technology.
Chinese people’s deepest anger over data may be the ticketing disaster of the 12306 website a few years ago. Going home for the Spring Festival is a tradition in Chinese people’s blood. But being such a populous country, China suffers from digital disaster every year. For the train lines in the physical world, this stresses transportation. Everyone is painfully squeezed in the passenger car without dignity. This situation has been gradually eased by high-speed railways. But the same congestion has shifted to the network. In order to facilitate ticket purchasing, the Ministry of Railways upgraded the informatization of the ticket-purchasing system and launched the 12306 website. However, at that time, nobody expected the data challenge brought by Internet. It was intended to facilitate ticket purchasing, but it was ended up in inconvenience. Hundreds of millions of people searching and buying tickets at the same time quickly made the server crash. People blamed programmers for incompetence and claimed that replacing them with e-commerce engineers could solve this problem. But the real key factor was that processing ability couldn’t keep up with data development. Someone specifically compared the e-commerce website with the 12306 website. In a Double Eleven (November 11th) sales promotion, Taobao and other e-commerce websites also took orders from a large number of people, but the companies distributed a great deal of goods with few problems. But the comparison is invalid: tens of thousands or even hundreds of thousands of people rushed to buy little more than a thousand seats for each departure. With every potential purchase, the ticketing system not only analyzed the data of all the stations of the line but also counted, dozens of times, the number of tickets for the line and updated the number of available tickets of all the stations in real time. One ticket influenced the entire line. The amount of data and calculations grew geometrically, and everything had to be done instantly, which is difficult to solve even with more servers, regardless of cost. Such a problem did not exist for large-scale e-commerce, and it was only alleviated after the new computing architecture and methods were explored.
Baidu is the first company among BAT to face big-data impact. Netizens from all over the country send massive amounts of searching data to the Baidu server. The network information that grows day and night also exhausts Baidu content crawlers. Baidu uses presearching and relevant-word searching to alleviate the transient data impact on the server. In the presearching mode, the system automatically searches for and fixes the search results when the number of search requests are low (such as early morning). When the user sends the search request, the system returns the finished result without searching all over again. Relevant-word recommendation uses the system’s relatively idle time and clear functional structure to analyze the user data behavior. For example, when the user enters “TPP” (Trans-Pacific Partnership Agreement) in the search input box, a drop-down menu will automatically pop up to provide search options, such as: “TPP means,” “TPP’s impact on China,” “TPP12 members,” “TPP protocol,” etc. Of course, the system will also automatically guess that a few users want to enter the phonetic abbreviation of “Tao Piao Piao” (a movie-ticket buying app), which will also be listed in a nonpriority position for users to choose. The arrangement of these options is understandable and meets the basic needs of most people.
At the bottom of the search results page, Baidu also provides a related word search.
In addition, the search engine also lists most frequently searched news related to TPP according to the searching popularity, which is convenient for users to obtain information.
The suggestions are all made by using the statistics of a large number of user searches, which helps to optimize the searching experience, boost search speed, and ease data-processing pressure.
Data can be a source of an infinite variety of fantastic problems. Data is not only composed of homogeneous bits but also related to different kinds of special human-activity scenarios, which puts data processing in a challenging position. But fundamentally, the problem is still the same contradiction between spinning jenny and the flying shuttle—all the progress of the hardware will be immediately consumed by the amount of calculation and data. Although hardware capabilities are growing very rapidly, doubling every eighteen to twenty-four months at the same cost (also known as Moore’s Law12), data is growing much faster than hardware. Why?
The Malthusian Trap of the Data Century
The Malthusian theory of population is well known; grain production grows arithmetically, while the population grows geometrically. If there is no big breakthrough in output, then the grain-based production materials can only rely on land expansion, while family population growth is exponential in the absence of birth control. As a result, the population quickly reaches the ceiling. After food crises, people were caught in wars, famine, diseases, and other disasters, and the population was greatly reduced. The industrial revolution, agricultural science, progress in technology, and population management have alleviated the Malthusian trap. Today, a similar catastrophe has appeared in the virtual world.
Thomas Malthus’s law for the big-data world can be described as follows:
• The population grows arithmetically, while the data grows geometrically.
• The amount of data increases linearly, and the amount of calculation grows nonlinearly.
The population of developed countries has increased slowly; some countries have even experienced negative growth. But the data generated by the world is always growing at a fast speed. This is because data is generated from all people and all human activities. As long as we want to record, countless data can be generated “out of nothing.”
In the early days, most e-commerce websites only pursued markets and users, with only an emphasis on operations instead of data or data layout. For example, if the user closed an order without purchasing, e-commerce systems did not record this behavior and only deletes it. However, later it was noticed that recording and analyzing the user’s failed transaction data is also of value and can be used to summarize user credits, preferences, etc., so then the data was recorded. Every behavior was recorded, and the amount of data began to multiply.
The storage of data has always been a big problem. According to Forrester Research, a well-known market-research firm, one smartphone can generate an average of 1G of data per day. The number of global smartphone users are conservatively estimated about more than two billion, generating more than 2 billion gigabytes (or 2 exabytes) of data per day. If we want to store this amount of data with ordinary 1 TB (with a capacity of 1024G) hard disks, we will need two million hard disks every day and nearly eight hundred million hard disks will be needed every year, which far exceeds the global output.
A more frightening fact is that most data is not generated from human activities. In 2014, Imperva Incapsula, a website security and content-distribution company, released a statistic: 56 percent of page views were contributed by crawler robots. In other words, the main Internet users are no longer human. Most of the click data is generated by machine programs.
Imperva Incapsula’s data comes from fifteen billion visits in ninety days from twenty thousand websites around the world with at least ten visits per day.
Nearly half of nonhuman access comes from benign robots, such as content crawlers of search engines, which can index web pages so that people can find the corresponding web content quickly. Baidu and Google use this method to organize information. However, more than half of the page views also come from malicious robots, such as pirate crawlers who steal content, various hacking tools, spam-sending tools, etc., and the proportion is still increasing. In a sense, this is a rich picture of the dark side of the Internet. This is only for web browsing. In the entire human society, with the rapid development of informatization and the Internet of things, all the hardware elements connected to the network are producing data and communicating with each other. The detection chip on the generator set detects the running status and sends the data back to the server. The cameras across the city upload the monitoring data to the command center, and the smart TVs, refrigerators, etc., in our homes are collecting and uploading data. Even if all humans are asleep, the world is still moving forward according to the rhythm of the data ocean.
The amount of data increases linearly, but the amount of calculation increases accordingly with a nonlinear index. Data must be processed to reach its value, but as the amount of data increases, the amount of calculation will increase at a faster rate. For example, the number of squares of Go is only five times that of chess, but the calculation is hundreds of millions of times more complicated than chess. For e-commerce and search engines, if the list of products or search results is sorted, the amount of calculation for this work will rise with a steep curve as the number increases.
Overcome the Trap of Data
To cross the Malthusian trap of the data century, we need to do three things:
• Deal with a large amount of concurrent data in an efficient way
• Store data efficiently and delete unnecessary data
• Mine the accumulated data
For the first point, the most groundbreaking processing technology is parallel computing or distributed computing. The huge data calculation tasks are split into small tasks, and each small task is assigned to one computer. After the computers finish the calculations separately, the results are summarized to obtain the final calculation product. Hadoop and Spark are representative technologies for this work. Hadoop’s MapReduce (a programming model for parallel computing of large data sets) can break a single task into segments and send them (map) to multiple nodes (or processes), followed by loading (reduce) them into the database as a single data set. Hadoop dynamically moves data between nodes and adjusts the dynamic balance between all tasks. This can greatly reduce task-queuing time and bring together the efficiency of general servers without any help from expensive supercomputers. This technology first played a role in web searching, and later it quickly became a platform for distributed computing. Spark technology can be viewed as an optimization of Hadoop technology. The output of the subtask is stored in the memory so that the storage file will not be frequently read to accelerate the speed. This optimization is still based on the deployment of hardware performance, with batch processing a basic method.
Batch processing is a reminder of the old era, a transitional half-old-half-new world. Before the advent of the data explosion, people generally processed data by storing it for analysis. Efficient analysis of static data relies primarily on batch-processing commands. The data-flow era forces people to deal with it by means of efficiency. It is best to complete the analysis and judgment immediately when the data event occurs, and then to pursue similar results as soon and as quickly as possible, such as with population-flow data and grid operating-status data. Batch-processing technology is not designed to cope up with the data torrent era and it can only be used to cope with big data flows.
The emergence and surge of data torrents will inevitably call for a flow-computing approach, which is the latest challenge and direction of innovation in the data field. Flow calculation is not a specific technology, but a general term for attitudes and methods.
Not all application scenarios require real-time computing, but more and more businesses are becoming aware of the need for it. Ideally, when an event occurs, the resulting data should be processed immediately. For example, for the mobile-phone billing, after a call is completed, the billing must be done instantly instead of later; the user portraits on a news website and an e-commerce website should also be completed in time, and the machine should be instructed to perform information recommendation, instead of analyzing user’s behavior data after storage. This also goes for many industrial units. But there are still so many difficulties in achieving real-time computing, such as data aggregation, communication, calculations, and so on.
Both Hadoop and Spark are precursors to flow computing, but they are far from enough. In 2014, Xing Bo, a professor at Carnegie Mellon University and chairman of the ICML2014 program, pointed out that a large quantity of resources of the big-data processing platforms is wasted on cluster communication. Even for a better platform, computing time is just only 20 percent, while communication time is about 80 percent. For example, Hadoop’s communication time accounts for 90 percent.13
Flow computing drives the development of intelligence. After all, smart is all about time. If we have infinite time and “constant dripping wears away a stone,” then it doesn’t matter if it is smart. Human beings want to find ways to quickly wear down stones with tools. This is called wisdom. Although flow computing replaces the traditional batch processing, it is still a long job. Scientists have made explorations of hardware structure and algorithm optimization, and we will not discuss these in detail here.
For the problems of huge data storage and data mining, if the data can be processed immediately to generate analysis, then it is unnecessary to store such a huge amount of data—as long as the log is stored to show, for example, whether the detection system is running normally. People have come up with various strategies for the data that must be stored. The compression software frequently used in the PC era is a representative method of data-volume streamlining. Compression software is an algorithm-based tool that recodes data and stores the encoded compressed data as well as the decoding keys to restore it at any time. People continue to improve file formats; for instance, video files get smaller, but clarity is maintained. People also introduce big-data strategies into storage. For example, if many people upload the same files to the cloud disk, then the system only keeps two backups, and all the files uploaded by others become virtual files linked to the same backup, which greatly saves space.
The increase in amount of data and computation necessitates a corresponding change in the entire information infrastructure. All the aforementioned data-processing methods are still developed on the old information infrastructure. But the development of machine intelligence has called for “physiological” changes of the machine brain.
Green Body
The human brain only accounts for about 2 percent of body weight but consumes 20 percent of the body’s total energy consumption, nearly 20 percent of the daily total oxygen consumption, and 75 percent of the blood glucose stored in the liver.
The same is true with what is happening in the field of machine intelligence. Data and algorithms are not physical, as opposed to matter and hardware, and are instead analogous to thought. But the operation of this “thinking” requires huge material resources and energy. In those large data centers, in addition to piles of servers, there are power supplies of all the sizes, environmental-control devices, monitoring devices, and various security devices that run around the clock like the brain. The organ itself also consumes a lot of energy.
The Internet provides 24/7 uninterrupted service, and the server consumes a lot of energy. According to statistics, only in 2011, China’s data center consumed about 70 billion kilowatts of electricity, accounting for nearly 1.5 percent of the total electricity consumption of the whole nation, and is equivalent to the annual electricity consumption in Tianjin.
In March 2015, the Ministry of Industry and Information Technology, National Government Offices Administration, and National Energy Administration formulated and promulgated the National Green Data Center Pilot Program, which revealed several figures:
With the rapid development of information technology, the construction of global data centers has obviously speeded up, and there are already more than 3 million data centers in the world now. The electricity consumption accounts for about 1.1 to 1.5 percent of the global amount. The problem of high energy consumption has attracted the attention of governments. Currently, the average power usage efficiency (PUE, or total energy consumption of data-center equipment) of the United States data center has reached 1.9, and the PUE of the advanced data center is less than 1.2. In recent years, China has built over 400,000 data centers with a rapid development. The annual electricity consumption exceeds about 1.5 percent of the total electricity consumption of the whole nation. The PUE of most data centers is still generally above 2.2, which is a big gap when compared with the international advanced level, and there is huge potential in energy saving. Power-saving, water-saving, low-carbon, and other technological products and advanced management methods are widely used to build a green data center to maximize energy efficiency and minimize environmental impact. The US government has already implemented the Data Center Energy Star, and Federal Data Center Integration Plan, and the European Union also implemented the Code of Conduct of Data Center Energy Efficiency. The International Green Network set the standards for data-center energy efficiency and best practices, which promote the improvement of energy conservation and environmental-protection level of the data center.
What exactly do we do?
Cooling down the equipment room requires constant innovation. Large companies can choose to place data centers in cold regions close to the poles, which made Iceland an important location for major data events in recent years, such as the WikiLeaks event (in 2010, Wikileaks released the US military’s classified documents about the war in Afghanistan) and the movie The Bourne Identity. Seawater cooling or air cooling are used to save energy and protect the environment.
Yangquan City, Shanxi Province, is an ancient city at the foot of Taihang Mountain with a long history. Li Yuan, Emperor Gaozu of Tang Dynasty, who attacked the central plains, set up a base camp 200 kilometers west of Yangquan. When Liu Cixin worked at the Niangziguan Power Plant, which is about 50 kilometers northeast of Yangquan, he wrote the world-class science fiction novel Three-Body Problem, which constructed the “sociology of the universe.” Yangquan is a coal region in China, where air pollution is more serious than in Beijing. It has long faced the challenge of industrial upgrading. The tension between history and the future is like the smog that spreads over the mountains and rivers.
In 2015, the Baidu Cloud Computing (Yangquan) Center (hereafter referred to as Yangquan Cloud Computing Center) was put into use. After its completion, the data center storage capacity reached more than 4,000PB (petabytes); the amount of information that can be stored is equivalent to more than two hundred thousand times the total collection of National Library of China. The total number of CPUs in the data center was up to seven hundred thousand, and the total amount of CPU cores exceeded three million. High-performance, low power-consumption servers and a number of other technologies that are applicable for China’s environment and regulations were installed to improve the overall energy efficiency of the data center with a PUE of less than 1.3. This means for every 1.3 kilowatts of electricity consumed in the computer room, 1 kilowatt is used for data calculation and 0.3 kilowatts for all other purposes, such as heat dissipation, which is first-class level in Asia in terms of green energy conservation.
Adhering to the spirit of openness, Baidu cooperated with Tencent, Alibaba, China Mobile, China Telecom, and other related industry leaders to jointly establish China’s first hardware open source project: Project Scorpio, aiming to create an open technical standard and to develop customized, full-rack server solutions to meet the data center’s massive computing and storage needs and effectively reduce data-center procurement and deployment costs.
In September 2014, Project Scorpio upgraded its server technology specification to the 2.0 version, with a more refined definition of space utilization and cooling strategy and a detailed definition of modules, interfaces, and protocols. Based on these standards, a double-digit power-consumption savings was realized through the integrated design of various resources in the Yangquan Cloud Computing Center. Through this plan, we see rapid iteration of the Internet in which products and services can quickly adapt to the changing needs, and constantly launch new versions to meet or lead the needs, always faster than the competitors. The brand-new 3.0 specification emphasizes more on modularity, while the details of the specification are more comprehensive and enforceable.
In May 2015, the solar photovoltaic power-generation project of Yangquan Cloud Computing Center was successfully connected to the power-generation grid. That was the first application of solar photovoltaic technology in a Chinese data center, which reduces 107.76 tons of carbon dioxide emissions per year and saves up to 43 percent of energy.
Computer Architecture Innovation
Although the steps are important, the energy-saving and emission-reducing methods are an external change, such as blowing cold air with an energy-saving air conditioner on someone with a fever; we need internal innovation of computers. Just as the batch processing was the product of the old era, the existing server and data center architecture is also built on the old-world computer technology, which is half old, half new.
The traditional computer core architecture is based on the von Neumann structure: separation of data storage and processing, and linearly distributed computational logic. The computing chip executes the instruction code and stores the result in memory for the next calculation instruction to call it. Such a structure is very clear for humans, but the speed is greatly affected. Moreover, in such a linear flow, the arbitrary instructions executed by the CPU requires the instruction memory, decoder, arithmetic unit, and branch jump processor to work together, with the assignment based on the order in which the instructions are executed. The logic of the control-instruction stream is complicated; it is difficult to have too many independent instruction streams, and the parallel-processing capability is low.
Moore’s Law is now out of date. At present, the annual increase of computer memory operation speed is only 9 percent and 6 percent for hard-disk performance. The running speed of computer memory is only a few hundredths of the CPU speed, which is a bottleneck. The pattern of data storage throughput has seriously degraded computer performance.
In the early days, someone had proposed a computer with changed architecture. Taking a personal computer as an example, it is based on general task. Even if a simple task such as typing is performed, the entire computer system is busy in operation, and all other the resources are therefore wasted. Computers that can change the architecture can call different parts of the computer in a controlled manner for tasks with different levels of complexity, without all the resources being called for both big and small tasks. Truly realistic computer innovations have found a way in the development of new technologies.
One direction is leading-edge physics, such as the fascinating quantum computing, which uses the quantum-state superposition effects in quantum physics to create a million times the performance of today’s computer chips. Replacing current transfer data and operations with an optical flow is also a direction to increase speed. Another direction has been learned through the rise of brain science and deep learning: it is hoped that imitating the human brain to develop neurological chips will lead to computer speed being orders of magnitude faster than existing computers.
People are trying different ways. It is an unprecedented step for deep-learning scientists to use GPUs instead of CPU groups for machine-learning technology. The GPU uses SIMD (single instruction, multiple data) to allow multiple execution units to process different data at the same time. Originally used to process image data, it is also particularly well suited for dealing with nonlinear discrete data that deep-learning tasks often encounter. Baidu uses large-scale GPU clusters to optimize its engineering and developed its own GPU server, which greatly improved hardware performance. But the GPU is also built on the von Neumann structure.
FPGA chips are another popular development. It was originally a solution to the application-specific integrated circuit (ASIC). An ASIC is an integrated circuit for a specific user or a particular electronic system. In the past, digital integrated circuits have greatly reduced the cost of electronic products, thanks to their versatility and scale production. But at the same time, the contradiction between general and special use and the disconnection between system design and circuit production arose. The larger the size of an integrated circuit, the harder it is to change the specific requirements when building a system. To solve these problems, there has been an application-specific integrated circuit that allows users to participate in design features, namely FPGA.
The design of the complex parallel circuit was applied to computing chips. The FPGA computing chip is covered with a “logic cell array” and includes three parts: a configurable logic module, an input and output module, and an internal connection. They are independent basic-logic unit modules that implement both combined logic functions and sequential logic functions, defining their respective logic and relationships with each other in a hardware description language. Unlike the von Neumann structure, memory in the structure has two main functions: to preserve the intermediate calculation results and to perform interunit communication. Since the memory is shared, when multiple instructions require memory to be called, access arbitration needs to be called sequentially; the registers and on-chip memory (BRAM, or block RAM, a fast and small internal memory) in the FPGA have their own control logic, without any unnecessary arbitration and buffering. The connection relationship between each logic unit of the FPGA and the surrounding logic unit is programmable, so it can be determined in advance without communication through shared memory. Parallel computing is the main operation, and at the same time multiple instruction streams and multiple data streams can be processed, which greatly saves computation time. FPGA can also be specially programmed on the hardware for different application scenarios with high flexibility.
Baidu began the layout of FPGA in 2012. It was the first company to introduce FPGA in China and also one of the first companies in the world to use FPGA for clustering. Ya-Qin Zhang said that from the beginning it was the CPU, and then GPU was used. Basically, all artificial intelligence companies use GPUs. But FPGA has its own advantages. One of many is the improved speed and efficiency of the entire architecture. GPU performs better on image and voice data, but FPGA is faster in many types of general-purpose computing. The programmable FPGA allows the architecture to be changed quickly. The FPGA used by Baidu is currently five to six times more efficient than GPU and CPU architectures and can be accelerated directly without changing the existing architecture.
From a computational point of view, network transmission is often considered as the most important bottleneck. Baidu has invested in the most advanced technology for the entire network communication, using 100G RDMA14 to communicate between GPU and FPGA. So data can be transferred quickly and flawlessly between clusters and databases.
FPGA is equivalent to programming software with hardware, and it is difficult to implement complex algorithms. Currently it works with GPU and CPU architecture.
Since probability calculation is a mathematical method commonly used by big data and artificial intelligence, some inspired people proposed the concept of the probability chip. The probabilistic algorithm is used to replace the previous calculus algorithm, which exchanges calculation precision for a great improvement on calculation speed and energy-consumption reduction. It is suitable for such applications that do not pursue extreme precision, such as the Internet of things.
With the rise of deep learning, chip scientists have been greatly inspired. The most cutting-edge chip innovation belongs to artificial neural-network chips, based on the principle of deep learning. Intel, IBM, NVIDIA, and all other major companies have set their own chip-development direction. The deep-learning chips launched by Chinese companies headed by Cambricon Technologies have already led the world for the category.
The artificial neural network is a general term for the computational architecture that mimics the biological neural network. It is interconnected by several artificial neuron nodes, which are connected by synapses. Here, each neuron is actually an excitation function, and synapses are the strong and weak weights that record the connections between neurons.
The neural network is multilayered, and the input of a neuron (function) is determined by using the output of the previous neurons connected to it and the weight of the connected synapses. The so-called training of a neural network is to adjust the output result by inputting a large amount of data and supervision. This process is to continuously adjust the synaptic weight between neurons until the output becomes stable and correct. Subsequently, when new data is entered, the output result can be calculated according to the current synaptic weight, thus realizing neural network’s “learning” of existing information. That is to say, the storage and processing in the neural network are integrated, and the intermediate calculation results become the weight of the synapses.
Traditional processors (including x86 and ARM chips) are subject to the von Neumann structure and are inefficient when dealing with deep-learning neural-network tasks. The storage and processing are separated, and its basic operations are arithmetic (addition, subtraction, multiplication, and division) and logical (and or not). The chips often require hundreds or even thousands of instructions to complete the processing of one neuron, and that is why AlphaGo requires so many chips (the distributed versions have 1,202 CPUs and 176 GPUs).
Chips specially designed for deep learning are different. Take the example of DianNaoYu, which is developed by Cambricon. The instruction set directly processes large-scale neurons and synapses. One instruction can complete the processing of a group of neurons and provide a series of specialized support for the data transmission of neurons and synapses on the chip. At the current technical level, the average performance of a single-core processor is more than one hundred times that of the mainstream CPU, with only about 10 percent of area and power consumption, and the overall performance can be improved by three orders of magnitude.
Of course, the neural network chip only has advantages over traditional CPUs on artificial-intelligence tasks and is suitable for image and speech recognition and similar, while traditional chips are better at performing tasks such as running databases, Office, and WeChat—unless such tasks were to undergo a structural revolution.
As the most basic method of artificial intelligence, the combination of deep learning and neural networks will determine the progress of artificial intelligence. This technique, which mimics the mechanism of the human brain, exhibits characteristics similar to biological evolution.
Competition or cooperation? People often struggle with this conceptual problem, but the philosophy of nature is focusing more on cooperation. In recent years, scholars in different fields have proposed the concept of coevolution. The relationship between the Danaus plexippus (also known as the monarch butterfly) and the plant milkweed is a typical example.
Milkweed juice is poisonous, and its closed structure makes it difficult to spread pollen through the wind, but the plant can attract butterflies by its nectar and pollinate through them. The monarch butterfly larvae feed on the young stems and leaves of the milkweed, whose toxins can be stored in its body to defend against enemies. If the butterfly larvae eat too many stems and leaves, the milkweed will die, and some of the milkweed will mutate to a more closed structure that can hinder the butterfly from entering. But some monarch butterflies will have enhanced ability to invade the milkweed-variation stamen. So the two become more and more inextricably involved in the encounter, since the monarch butterfly does not eat other plants and milkweed does not welcome other insects; thus, no third party can join their game. Virus and antivirus software, hacking and antihacking procedures are examples of coevolution on the Internet. Machine learning has now been applied to network security, and its efficiency has been greatly improved compared to past firewalls based on set features. Coevolution is not a life-or-death struggle, nor is it a sigh of relief, but an upgrade in methods of confrontation.
Artificial intelligence is also coevolving. In the ever-changing neural networks, this process is vividly revealed. Two of the new neural network ideas are introduced as follows.
Generative Adversarial Networks
Supervised deep learning means that the input data has semantic labels, and the output results are marked by human beings. But many scientists believe that unsupervised learning is the future direction of development, which allows the machine to find the law from the original data. There are already many different approaches with reinforcement learning being one of the directions, and generative adversarial networks are already in use.
Ian Goodfellow, the inventor of generative adversarial networks, is a student of Yoshua Bengio and now works at Apple as the director of machine learning. Yann LeCun, a well-known deep-learning expert, praised generative adversarial networks. This kind of network can well reflect the entanglement and evolvement feature of “evolution.”
Generative adversarial networks are derived from the concept of adversarial examples, which was first introduced by Christian Szegedy and others in the paper published by ICLR2014 (International Conference on Learning Representations). Subtle interference is deliberately added to the input data set to form an input sample, resulting in erroneous output of the deep neural network. This error is obvious in the human’s eyes, but the machine can repeatedly fall into the trap.
Ian Goodfellow, Jonathon Shlens, and Christian Szegedy give a typical example in the paper “Explaining and Harnessing Adversarial Examples.”
People add tiny interference to a picture of a panda. Modifications are performed in 32-bit floating-point values without affecting the 8-bit representation of the image.
The human eye cannot see the difference at all, but the neural network surprisingly judges the picture as a gibbon with a 99.3 percent confidence level. Because the confrontational samples lead to recognition errors, some people regard them as deep learning’s deep flaws. But Zachary Chase Lipton, formerly from the University of California, San Diego, and currently with Carnegie Mellon University, published an article at KDnuggets (a big-data media site in the United States) with the title (Deep Learning’s Deep Flaws)’s Deep Flaws.15 This article argues that the vulnerability of deep learning to adversarial samples is not unique to deep learning and that it prevails in many machine-learning models. Further research on algorithms that resist adversarial samples will facilitate the advancement of the entire machine-learning field.
Scientists have grasped the fragility of “evolution” and see nature as making the best from a mistake, regarding confrontation as a training method that turns all the obstacles into motivation to advance through difficulties. The evolution of nature itself is also highly fragile, and countless biological “programs” are eliminated from nature because they are “faulty.” Error is the ultimate tool of evolution. Wisdom is rising with difficulty in the process of endless birth and death.
The generative adversarial network is an ability of neural network specially designed by humans to actively generate interference data to train the network. Simply, the generative adversarial network consists of two parts: one is the generator and the other is the discriminator. The generator is like a profiteer who sells deceptive fake goods, and the discriminator is like a superb buyer who needs to identify the authenticity of the goods.
The job of the profiteer is to find ways to deceive the buyer (generating confrontational samples), while the buyer learns through experiences and reduces the probability of being cheated. Both profiteer and buyer are constantly striving to achieve their own goal, while at the same time they are the pursuing advancement under the supervision of each other, like the Blue Army and Red Army confronting fiercely in military exercises, thus strengthening the fighting skills of both sides, without smoke.
The generative adversarial network and the response to it demonstrate coevolution. It is a profound philosophy of evolution, an entanglement instead of a war, which maintains a precarious balance.
Do we want the mature buyer or the superb profiteer? The answer is both; they are inevitable elements of coevolution.
What is the use of the profiteer model? In many cases, we are lacking data, which can be complemented by generating models. Making unsupervised samples produces effects similar to supervised learning.
Wei Li and Roderich Groß from the University of Sheffield, UK, and Melvin Gauci from Harvard University, US, developed a new Turing learning method for studying group behavior, based on the generative adversarial network.16 A group of fish is mixed with some fake fish that mimic real fish movement. How does one judge the fidelity of imitation behavior? It is difficult to distinguish between movement behavior by traditional feature induction methods, and the motion characteristics of the same group of fish are not necessarily similar each time. The team decided that to let the machine automatically build a group model by emulating both sets, allowing the machine to infer the behavior of natural objects and imitations. This deep learning simultaneously optimizes two groups of computer programs, one representing the behavior of the model and the other representing the classifier. The model can mimic the behavior of supervised learning as well as the behavior between the system and other models.
To be more specific, they established three groups of robots, the first being the imitated objects, which perform complex movements according to prespecified rules; the second is imitators, which mixed into the first group, trying to learn and imitate the first group’s behavior and to deceive the discriminator; the third is discriminators, whose task is to distinguish the imitator from the imitated in the group movement. As the discriminator became increasingly discerning, the imitation behavior became confusing. Therefore, we can use the trained imitators to build a realistic multiagent model to simulate the group of imitators. This model can be used to study collective movement behavior. For example, a model can be trained according to the crowd movements at popular holiday spots, recorded by the camera to improve the prediction of crowd-movement trends and to issue an early warning of possible congestion and stampede.
The evolutionary iteration of machines is a zillion times faster than nature. In this kind of adversarial generation, the logic the machine acquired has gone far beyond human understanding and may become a “black box.” It is a big challenge to choose between the black box and white box and to avoid the incomprehensible danger of the black box.
Dual Network
The dual network seems to be a mirror image of the generative adversarial network.
At present, most of the neural-network training relies on tagged data—that is, supervised learning. But labeling data is an onerous task. According to reports, Google’s Open Images data set, Google’s open-source image database, contains about nine million images; the YouTube-8M data set contains eight million segments of marked video; and ImageNet, as the earliest image data set, currently more than fourteen million classified images. All three data sets cost fifty thousand employees through Amazon Mechanical Turk, an Amazon labor-outsourcing platform, and required two years to complete most of the well-marked data.
To enable the machine work in the absence of labeled data is the future direction of artificial intelligence. In 2016, Dr. Tao Qin and others from Microsoft Research Asia presented a new machine-learning paradigm, dual learning, in a paper submitted to NIPS (Conference on Neural Information Processing Systems) 2016. The general idea is that many applications of artificial intelligence involve two dual tasks. For example, translation from Chinese to English and translation from English to Chinese are dual, speech recognition and speech synthesis are dual in speech processing, image-generated text and text-based imaging are dual in image understanding, answering and generation of questions in the question-and-answer system are dual, and so is the searching of related web pages by keywords and the generation of keywords for pages. These dual artificial-intelligence tasks can form a closed loop and make it possible to learn from unlabeled data. The key of dual learning is that the model of the second task can provide feedback to the given original-task model; similarly, the model of the original task can also provide feedback to the model of the second task. Thus, the dual tasks can provide feedback to each other, as well as learn from and improve each other.
The use of such a subtle strategy by dual networks greatly reduces the reliance on annotation data from which we can once again find some evolutionary philosophy: evolution is a cyclic process of response and receipt, from A to B and from B to A. They are mirror images of each other, but the mirrors are not clear. They each have half of the secret, without arbitration, but they move forward in mutual guess and reference.
The foregoing two neural-network methods are only typical manifestations of constantly emerging new methods. In addition to deep neural-network methods, scientists are actively exploring other paths. Professor Zhou Zhihua, a famous machine-learning expert at Nanjing University, presented a creative algorithm with coauthor Feng Ji in a paper published on February 28, 2017, which can be called the “gcForest” algorithm. As the name hints, this algorithm uses the traditional decision-tree algorithm, but emphasizes the tree hierarchy, as opposed to deep learning, which basically emphasizes the number of layers of the neural network. The multilevel decision trees form a “forest.” Through some sophisticated algorithm settings, in the case of small data size and computing resources and in the application of image, sound, emotion recognition, etc., its results are at least equal to that of the neural network. This new method is insensitive to parameter settings, and the logic-based tree approach makes it easier to theoretically analyze than deep neural networks, thus avoiding the difficulty of human understanding of the black box problem of the machine’s specific operational logic.
Source: https://arxiv.org/pdf/1702.08835.pdf
Multi-granularity cascaded forest |
65.67% |
Convolutional neural network |
59.20% |
Multiple layer neural network |
58.00% |
Random forest |
50.33% |
Logistic regression |
50.00% |
Support vector machine based on RBF kernel function |
18.33% |
Source: https://arxiv.org/pdf/1702.08835.pdf
According to the think tank AI Era, Professor Zhou Zhihua believes that the methodological significance of deep forest is to explore the possibility of algorithms from outside the deep neural network. The effective operation of deep neural networks requires strong power for data and computing, and deep forests are likely to offer new options. Of course, deep forests still draw key ideas from deep neural networks, such as the ability to extract features and build models. Therefore, it is still a novel branch of deep learning. Chinese scientists have many world-leading achievements in artificial-intelligence research. We believe that self-confidence and an open mind will be an important driving force for scientific progress. Today, major technology companies for artificial intelligence advocate sharing algorithmic code; Google’s TensorFlow deep-learning open-source platform is the most influential example. However, many deep-learning scientists believe that from an economic point of view parallel competition between more deep-learning code platforms will be conducive to prosperity and a counterbalance to monopoly. In addition to the deep-learning open-source platforms such as Caffe and MXNet from other companies, Baidu opened PaddlePaddle, a deep-learning open-source platform, in September 2016. With the new architecture, it has good support for serial input, sparse input, and large-scale data model training. It supports GPU computing, data parallelism and model parallelism, and training deep-learning models with a small amount of code, which greatly reduces the cost of deep-learning technology. A diverse shared platform enables machine learners to train and create applications from different perspectives, a type of biodiversity to contribute toward the advancement of artificial intelligence.
Even if only in the remote future, artificial intelligence can really be powerful enough to rule the world. All challenges thus far have been related to human wisdom. But the flash of wisdom from artificial-intelligence scientists illuminates the direction for the latecomers. Even non-artificial-intelligence practitioners may get a great deal of strategy and inspiration.
At the beginning of 2017, Master, a version of AlphaGo, swept the top players from China and Korea. For a time, people have been divided into different categories: pessimists, adventurers, calm people, outrageous people. We hope that more people will be in the silently learning category.